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Abstract 

This paper presents an investigation of the approximation property of neural networks with un¬ 
bounded activation functions, such as the rectified linear unit (ReLU), which is the new de-facto 
standard of deep learning. The ReLU network can be analyzed by the ridgelet transform with respect 
to Lizorkin distributions. By showing three reconstruction formulas by using the Fourier slice theo¬ 
rem, the Radon transform, and Parseval’s relation, it is shown that a neural network with unbounded 
activation functions still satisfies the universal approximation property. As an additional consequence, 
the ridgelet transform, or the backprojection filter in the Radon domain, is what the network learns 
after backpropagation. Subject to a constructive admissibility condition, the trained network can be 
obtained by simply discretizing the ridgelet transform, without backpropagation. Numerical examples 
not only support the consistency of the admissibility condition but also imply that some non-admissible 
cases result in low-pass filtering. 


1 Introduction 

Consider approximating a function / : R m —» C by the neural network gj with an activation function 
77 : ffi. —* C 

1 J 

9 j{x) = ■ x ~ bj), (a.j,bj,Cj) e xIxC (1) 

j 

where we refer to (a^, bj) as a hidden parameter and Cj as an output parameter. Let Y m+1 denote the space 
of hidden parameters x R. The network gj can be obtained by discretizing the integral representation 
of the neural network 

P(x) = f T(a,b)g(a-x~b)dg(a,b), (2) 

J ¥”* + ! 

where T : Y m+1 —» C corresponds to a continuous version of the output parameter; ji denotes a measure 
on Y m+1 . The right-hand side expression is known as the dual ridgelet transform of T with respect to 77 

^T(x) = f T(a,6)77 (a-x- 6) ( ^ 6 . (3) 

Jy">+! Il a ll 

By substituting in T(a, b) the ridgelet transform of / with respect to if 

•^V/( a , b) := f /(x)b( a -x-6)||a||dx, (4) 

Jr™ 

under some good conditions, namely the admissibility of (if, 77) and some regularity of /, we can reconstruct 

/by 

Sfyl+f = /• ( 5 ) 

By discretizing the reconstruction formula, we can verify the approximation property of neural networks 
with the activation function 77. 

*s. sonoda0110@toki.waseda.jp 
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In this study, we investigate the approximation property of neural networks for the case in which 
77 is a Lizorkin distribution, by extensively constructing the ridgelet transform with respect to Lizorkin 
distributions. The Lizorkin distribution space S ' 0 is such a large space that contains the rectified linear unit 
(ReLU) z + , truncated power functions z[ ji, and other unbounded functions that have at most polynomial 
growth (but do not have polynomials as such). Table [l] and Figure [l] give some examples of Lizorkin 
distributions. 


Table 1 : Zoo of activation functions with which the corresponding neural network can approximate ar¬ 
bitrary functions in L 1 (K m ) in the sense of pointwise convergence (§ 5 . 2 1 and in L 2 (R m ) in the sense of 
mean convergence (§ 5 . 3 ). The third column indicates the space W’(R) to which an activation function ry 


belong (§ 6.1 6.2). 


activation function 

V(z) 

W 

unbounded functions 

truncated power function 

k \ z k z > 0 

z+ := < n . ke N 0 

S'o 

rectified linear unit (ReLU) 

[0 z < 0 

2 + 

So 

softplus function 

cd _1 )(z) := log(l + e z ) 

O m 

bounded but not integrable functions 

unit step function 

z° 

S'o 

(standard) sigmoidal function 

a(z) := (l + e-*)" 1 

O m 

hyperbolic tangent function 

tanh(z) 

O m 

bump functions 

(Gaussian) radial basis function 

G(z) := ( 27 t ) -1 / 2 exp (—z 2 /2) 

S 

the first derivative of sigmoidal function 

a'(z) 

S 

Dirac’s 6 

S(z) 

S'o 

oscillatory functions 

the k th derivative of RBF 

G (*>(*) 

S 

the k th derivative of sigmoidal function 

a^(z) 

s 

the k th derivative of Dirac’s 5 

(z) 

S'o 


Recall that the derivative of the ReLU z + is the step function . 
formula 


Formally, the following suggestive 


f T(a, b)r]'(a • x — 6) -jpy* = f d b T(a, b)rj(a ■ x - b ) d ^, (6) 

JY^+i ||a|| Jy™+i Il a ll 

holds, because the integral representation is a convolution in b. This formula suggests that once we have 
T ste p(a, b ) for the step function, which is implicitly known to exist based on some of our preceding studies 
mm, then we can formally obtain Tn e Lu(a, b) for the ReLU by differentiating Tn e Lu( a , b) = d b T ste p(a, b). 

1.1 ReLU and Other Unbounded Activation Functions 

The ReLU 0 SI 13 IS] became a new building block of deep neural networks , in the place of traditional 
bounded activation functions such as the sigmoidal function and the radial basis function (RBF). Compared 
with traditional units, a neural network with the ReLU is said 00 0 El El to learn faster because it 
has larger gradients that can alleviate the vanishing gradient [ 3 j, and perform more efficiently because it 
extracts sparser features. To date, these hypotheses have only been empirically verified without analytical 
evaluation. 

It is worth noting that in approximation theory, it was already shown in the 1990 s that neural networks 
with such unbounded activation functions have the universal approximation property. To be precise, if the 
activation function is not a polynomial function, then the family of all neural networks is dense in some 
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Figure 1: Zoo of activation functions: the Gaussian G(z) (red), the first derivative G'(.s) (yellow), the 
second derivative G"(z) (green); a truncated power function z\ (blue), the ReLU z + (sky blue), the unit 
step function (rose). 


functional spaces such as L p (M m ) and C 0 (M m ). Mhaskar and Micchelli m seem to be the first to have 
shown such universality by using the B-spline. Later, Leshno et al. El reached a stronger claim by using 
functional analysis. Refer to Pinkus (T2] for more details. 

In this study, we initially work through the same statement by using harmonic analysis, or the ridgelet 
transform. One strength is that our results are very constructive. Therefore, we can construct what the 
network will learn during backpropagation. Note that for bounded cases this idea is already implicit in 
m and [2j, and explicit in m- 

1.2 Integral Representation of Neural Network and Ridgelet Transform 

We use the integral representation of neural networks introduced by Murata pQ. As already mentioned, 
the integral representation corresponds to the dual ridgelet transform. In addition, the ridgelet transform 
corresponds to the composite of a wavelet transform after the Radon transform. Therefore, neural networks 
have a profound connection with harmonic analysis and tomography. 

As Ktirkova m noted, the idea of discretizing integral transforms to obtain an approximation is very 
old in approximation theory. As for neural networks, at first, Carroll and Dickinson m and Ito m 
regarded a neural network as a Radon transform El- Irie and Miyake 0, Funahashi 1, Jones ED], 
and Barron [21| used Fourier analysis to show the approximation property in a constructive way. Kurkova 
m applied Barron’s error bound to evaluate the complexity of neural networks. Refer to Kainen et al. 
[33] for more details. 

In the late 1990s, Candes [33] [31], Rubin [35], and Murata [1| independently proposed the so-called 
ridgelet transform, which has since been investigated by a number of authors 37, :3B, 39.130 [3T]. 
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1.3 Variations of Ridgelet Transform 

A ridgelet transform along with its reconstruction property, is determined by four classes of functions: 
domain A(R m ), range ]V(Y m+1 ), ridgelet Z(R), and dual ridgelet W(R). 


i\) e Z(R) 


l )3f 


T e ^(Y m+1 ) 


T] G W(R) 


(7) 


The following ladder relations by Schwartz [32] are fundamental for describing the variations of the 
ridgelet transform: 

(functions) V cz S a V L i cz T>lp a Om c £ 

n n n n n n 

(distributions) £' cz (D' c cz V' Ll cz V' LP cz S' <z V ’ 


integrable not always bounded 

where the meaning of symbols are given below in Table [2] 

The integral transform T by Murata [I] coincides with the case for Z cz V and W <z £ n L 1 . Candes 
[23122 proposed the “ridgelet transform” for Z = W cz S. Kostadinova et al. [HU El] defined the ridgelet 
transform for the Lizorkin distributions X = <Sg, which is the broadest domain ever known, at the cost of 
restricting the choice of ridgelet functions to the Lizorkin functions W = Z = 6> 0 c S. 


1.4 Our Goal 

Although many researchers have investigated the ridgelet transform [SB]; 12911501 [STj , in all the settings Z 
does not directly admit some fundamental activation functions, namely the sigmoidal function and the 
ReLU. One of the challenges we faced is to define the ridgelet transform for W = Sq, which admits the 
sigmoidal function and the ReLU. 


2 Preliminaries 

2.1 Notations 

Throughout this paper, we consider approximating / : —» C by a neural network g with hidden 

parameters (a, b). Following Kostadinova et al. [30, [31], we denote by Y m+1 := x R the space of 
parameters (a, b). As already denoted, we symbolize the domain of a ridgelet transform as A(R m ), the 
range as J^(Y m+1 ), the space of ridgelets as Z(M), and the space of dual ridgelets as W(R). 

We denote by S m_1 the (m — l)-sphere {u e R m | ||u|| = 1}; by M + the open half-line {a 6 R | a > 0}; 
by H the open half-space R+ x R. We denote by N and No the sets of natural numbers excluding 0 and 
including 0, respectively. 

We denote by~the reflection /(x) := /(—x); by 7 the complex conjugate; by a < b that there exists a 
constant C ^ 0 such that a < Cb. 

2.2 Class of Functions and Distributions 

Following Schwartz, we denote the classes of functions and distributions as in Table [2] For Schwartz’s 
distributions, we refer to Schwartz [32^ and Treves [53]; for Lebesgue spaces, Rudin [32, Brezis [35] and 
Yosida [36] : for Lizorkin distributions, Yuan et al. m and Holschneider [38] . 

The space 6>o(R fe ) of Lizorkin functions is a closed subspace of <S(R fc ) that consists of elements such that 
all moments vanish. That is, <5>o(R fe ) := e <S(R fe ) | x“^(x)dx = 0 for any a e Nq}. The dual space 
Sq( R fc ), known as the Lizorkin distribution space, is homeomorphic to the quotient space of S'( R fc ) by the 
space of all polynomials V(R k ). That is, <So(R fe ) = S'(R k )/P(R k ). Refer to Yuan et al. [37] Prop. 8.1] 
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Table 2: Classes of functions and distributions, and corresponding dual spaces. 


space 

A(R k ) 

dual space 

A’(R k ) 

polynomials of all degree 

V(R k ) 

- 


smooth functions 

£(R k ) 

compactly supported distributions 

£'(R k ) 

rapidly decreasing functions 

S(R k ) 

tempered distributions 

S'(R k ) 

compactly supported smooth functions 

V(R k ) 

Schwartz distributions 

V(R k ) 

L p of Sobolev order 00 (1 ^ p < 00) 

V LP (R k ) 

Schwartz dists. (1 /p+ 1/q = 1) 

V' Lq (R k ) 

completion of V(R k ) in T> L ao(R k ) 

B(R k ) 

Schwartz dists. (p = 1) 

V' Ll (R k ) 

slowly increasing functions 

o M (^ k ) 

- 


- 


rapidly decreasing distributions 

0' c (R k ) 

Lizorkin functions 

S 0 (R k ) 

Lizorkin distributions 

S' 0 (R k ) 


for more details. In this work we identify and treat every polynomial as zero in the Lizorkin distribution. 
That is, for p e S'(R k ), if p e V(R k ) then p = 0 in 5(,(R /c )- 

For S m_1 , we work on the two subspaces 2?(S m_1 ) cz Z?(R m ) and 5'(§ m_1 ) cz 5'(K m ). In addition, we 
identify V = 5 = Om = 5 and £’ = 0' c = S' = V Lv = V. 

For H, let 5(H) cz 5(K 2 ) and X>(H) cz V(R 2 ). For T e £ (H), write 

D*’ t / T(a, p) := (a + 1 /a) s (1 + p)^ 2 d%T(a, /3), s, t,k,£e N 0 . (9) 

The space 5(H) consists of T e 5(H) such that for any s, t, k, ^ e No, the seminorm below is finite 

sup |Dg’jT(a,/3)| < oo. (10) 

(c«,/ 3 )eH 

The space Gm{ H) consists of T e 5(H) such that for any k,(. e No there exist s,t e No such that 

|D^T(a, p)\ < (a + l/a) s (l + /3 2 ) t/2 . (11) 

The space 2?'(H) consists of all bounded linear functionals $ on 2?(H) such that for every compact set 
K <z H, there exists N e No such that 


f T(a,/?)*(«>/*) — 

J K OL 


< 2 SU P l D o,o T ( Q! >^)l> 

k,e^N O,/ 3 )^ 


VTeP(K), 


( 12 ) 


where the integral is understood as the action of <F The space 5'(H) consists of d> e 5(H) for which there 
exists N g No such that 

< ^ SU P |<fT(a,/3)|, VT e 5(H). (13) 


f 

JH a 


2.3 Convolution of Distributions 

Table [3] lists the convergent convolutions of distributions and their ranges by Schwartz [52]. 

In general a convolution of distributions may neither commute <p * ip =£ ip * <fi nor associate <p* (ip * rf) ^ 
(c p*ip ) * 77 . According to Schwartz [32j Ch .6 Th.7, Ch.7 Th.7], both D' *£'*£'*■■ ■ and 5' * Q' c * 0’ c * • • • 
are commutative and associative. 


2.4 Fourier Analysis 


The Fourier transform 7 of / : R m —> C and the inverse Fourier transform 7 of F : K m —» C are given by 

M) ■■= f /(x)e- lx «dx, £ e (14) 

J R™ 

i?( X ):=7-^f xer (15) 

Jm"* 
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Table 3: Range of convolution (excerpt from Schwartz [321) 


case 

Ai 

A-2 

Ai * A 2 

regularization 

V 

'D',V' Lp ,£' 

£,L p ,V 

compactly supported distribution 

£' 

£',£,& 

£',£,& 

regularization 

S 

S,S' 

S, Om 

Schwartz convolutor 

O'c 

S,0' c ,V' LP ,S' 

S,0' c ,V LP ,S' 

Young’s inequality 

L p 

Li 

L r (1/r = 1/p + l/q - 1) 

Young’s inequality 

V 

u Lp 

'£ > Li,F , ' Lq 

V' L r (1/r = 1 /p+ 17 - 1 ) 


The Hilbert transform H of / : M —* C is given by 

vf( , * r m ,, m 

rLj(s) := — p.v. -di, sels 

^ J — 00 ^ ^ 

where p.v. J” denotes the principal value. We set the coefficients above to satisfy 

= sgnw • /(w), 

n 2 f(s) = f(s). 


(16) 


(17) 

(18) 


2.5 Radon Transform 


The Radon transform R of / : M m —» C and the dual Radon transform R* of $ : S m 1 x M —» C are given 
by 

R/(u,p) := f /(pu + y)dy, (u ,p) e S™" 1 x M (19) 

J(Ru)-L 

R*$(x) := f $(u,u• x)du, xeK m (20) 

Jgm-l 

where (Mu) x := {y e M m | y • u = 0} denotes the orthogonal complement of a line Mu cz M m ; and dy 
denotes the Lebesgue measure on (Mu) 1 ; and du denotes the surface measure on § m_1 . 

We use the following fundamental results ([T7| [39]) for / e L 1 (M m ) without proof: Radon’s inversion 
formula 


R*A m_1 R / = 2(27r) m "7, 


where the backprojection filter A m is defined in (24); the Fourier slice theorem 

e S m_1 x M 


/H = f R/(u,p)e lpu d p, (u,w) 

Jr 


( 21 ) 


( 22 ) 


where the left-hand side is the m-dimensional Fourier transform, whereas the right-hand side is the one- 
dimensional Fourier transform of the Radon transform; and a corollary of Fubini’s theorem 

f R/(u,p)dp = f /(x)dx, a.e. uG§ m_1 . (23) 

Jr Jr™ 


2.6 Backprojection filter 

For a function <h(u,p), we define the backprojection filter A m as 


A m $(u,p) : = 


d™4>(u,p), m even 
’H p d'ff^(vi,p), m odd. 


(24) 


where ji p and d p denote the Hilbert transform and the partial differentiation with respect to p, respectively. 
It is designed as a one-dimensional Fourier multiplier with respect to p —» u> such that 


A m $(u,w) = i m M m $(u,w)- 


(25) 
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3 Classical Ridgelet Transform 

3.1 An Overview 

The ridgelet transform of / : R m —» C with respect to : K —» C is formally given by 

b) := f /(x)0(a • x — 6 )||a|| s dx, (a, b) e Y m+1 and s > 0. 

Jr™ 


( 26 ) 


The factor |a| s is simply posed for technical convenience. After the next section we set s = 1, which 
simplifies some notations (e.g., Theorem 4.2). Murata [T] originally posed s = 0, which is suitable for the 
Euclidean formulation. Other authors such as Candes ; 23j used s = 1/2, Rubin (25] used s = to, and 
Kostadinova et al. m used s = 1 . 

When / e L 1 (R m ) and ip e L C 0 (R), by using Holder’s inequality, the ridgelet transform is absolutely 
convergent at every (a, b ) e Y m+1 . 


f |/( x )'0( a • x — 6 )||a|| s |dx < ||/|Ui(M—) • IMU°°( 


a < oo. 


(27) 


In particular when s = 0, the estimate is independent of a and thus e L°°(Y m+1 ). Furthermore, 8? 
is a bounded bilinear operator L 1 (R m ) x Z,°°(R) —> L°°(Y m+1 ). 

The dual ridgelet transform i^T of T : Y m+1 —» C with respect to 77 : R —» C is formally given by 


^T(x) := f T(a, b)r/(a • x — 6 ) ||a|| s dad 6 , xel m . (28) 

J¥ m + 1 

The integral is absolutely convergent when 77 e _L°°(R) and T e T 1 (Y m+1 ; ||a|| _s dad&) at every x e R m , 

I |T(a, b)rj(a. ■ x — 6 )|||a|| _s dadfe < ||T|| L i (Ym+ i.|| a ||- sdadb) • ||r?||z, C o( R ) < 00, (29) 

J Ym+ i 

) -> L°°(]R m ). 


and thus 8$ is a bounded bilinear operator L 1 (Y m+1 ; ||a|| s dad 6 ) x L 
Two functions ip and 77 are said to be admissible when 


K ■ 0^-1 f 

K 8,v ■= (2tt) J dC, 


(30) 


is finite and not zero. Provided that ip, 77 , and / belong to some good classes, and ip and 77 are admissible, 
then the reconstruction formula 


= K v ^f, 


(31) 


holds. 

3.2 Ridgelet Transform in Other Expressions 

It is convenient to write the ridgelet transform in “polar” coordinates as 

^/( u ,a,/3)=f f^ipf 11 —^ ——^dx, (32) 

Jr™ V a ) a 

where “polar” variables are given by 

u: = a /|| a |, a := l/|| a ||, /3:=b/\\a\\. (33) 

Emphasizing the connection with wavelet analysis, we define the “radius” a as reciprocal. Provided there 
is no likelihood of confusion, we use the same symbol Y m+1 for the parameter space, regardless of whether 
it is parametrized by (a, b) e R m x R or (u, a, /3) e S m_1 x R + x R. 

For a fixed (u ,a,/3) e Y m+1 , the ridgelet function 

Vw(x) :=ip( U ' X ~ P ) x e R m (34) 

\ a ) a s 
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behaves as a constant function on (Ru) x , and as a dilated and translated wavelet function on Ru. That 
is, by using the orthogonal decomposition x = pu + y with peR and y e (Ru) x , 

^w , J " ■ fr-+:- ■» ) 2 _, f (izl) 2 _ g, 1(y) . (35) 

\ a ) a s \ a J a s 

By using the decomposition above and Fubini’s theorem, and assuming that the ridgelet transform is 
absolutely convergent, we have the following equivalent expressions 




f(pu + y)dy 0 


P~P 


~Ap 


P~P\ 1 , 
—dp 


r(r 

JR I 

f R/(u,p)0 

JR 

j ct 1_s R/ (u, az + 0) 'ip(z)dz 

JR 

(R/(u, •) *0a) (/?), 1p a (p) ■= 0 1 

f f(uju)ip(au})a 1 ~ s e lul ^duj, 

Jr 


1 

27T 



(36) 


(37) 

(weak form) 

(38) 

(convolution form) 

(39) 

(Fourier slice th. [501) 

(40) 


where R denotes the Radon transform (19); the Fourier 
to the convolution form. These reformulations reflect a 
wavelet analysis in the Radon domain. 


form follows by applying the identity 1 J- p = Id 
well-known claim [28., '3R that ridgelet analysis is 


3.3 Dual Ridgelet Transform in Other Expressions 

Provided the dual ridgelet transform ( f28| ) is absolutely convergent, some changes of variables lead to other 
expressions. 


^T(x) 


f f T(a, 6)p(a • x — 6)||a|| s d6da 

Jr™ Jr 

f f f T(ru,6)r ? (ru-x-6)d6dur m " s " 1 dr 

Jo J§ m_1 Jr 

f ffT( = ,£), 

Js m ~ i Jo Jr \a a J 


u • x — /3\ d/3dadu 


m— s+2 


f [ [ T (u, a, u • x - az) r](z) 

J s™- 1 Jo Jr 


dzdadu 

n/ m—s +1 ’ 



(41) 


(42) 

(polar expression) 

(43) 

(weak form) 

(44) 


where every integral is understood to be an iterated integral; the second equation follows by substituting 
(r, u) <— (||a||,a/||a||) and using the coarea formula for polar coordinates; the third equation follows by 
substituting (a, /?) <— (1/r, b/r) and using Fubini’s theorem; in the fourth equation with a slight abuse of 
notation, we write T(u, a,0) := T(u/«, /3/ck:). 


Furthermore, write i] a (p) ■= v(p/ a )/ at ■ Recall that the dual Radon transform R* is given by (20) and 

” f(a)a z 


the Mellin transform M i35j is given by Mf(z) := f(a)a z 1 da, z e C. Then 


^T(x) = R* [Ad[T(u, a, •) * 770,] (s + t — m — 1)] (x). ( 45 ) 

Note that the composition of the Mellin transform and the convolution is the dual wavelet transform [58]. 
Thus, the dual ridgelet transform is the composition of the dual Radon transform and the dual wavelet 
transform. 


4 Ridgelet Transform with respect to Distributions 


Using the weak expressions (38) and (44), we define the ridgelet transform with respect to distributions. 


Henceforth, we focus on the case for which the index s in (26) equals 1. 
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4.1 Definition and Well-Definedness 


Definition 4.1 (Ridgelet Transform with respect to Distributions). The ridgelet transform of a 

function / e Y(R m ) with respect to a distribution ip e Z(R) is given by 

^V>/(u, a,/3):= f R/ (u, az + 0) ip(z)dz, (u, a, (3) e ¥ m+1 (46) 

Jr 

where • ip(z)dz is understood as the action of a distribution ip. 

Obviously, this “weak” definition coincides with the ordinary strong one when ip coincides with a 
locally integrable function (Lj oc ). With a slight abuse of notation, the weak definition coincides with the 
convolution form 

^/(u,a,/3) = (R/(u,0*^)G9), (u, a, (3) e Y m+1 (47) 

where ip a (p) '■= tp(p/oi) /a ; the convolution • * dilation - Q , reflection ~, and complex conjugation 7 are 
understood as operations for Schwartz distributions. 

Theorem 4.2 (Balancing Theorem). The ridgelet transform & : A’(K Tn ) x Z(R) —> (V(Y m+1 ) is well 
defined as a bilinear map when X and Z are chosen from Table 

Table 4: Combinations of classes for which the ridgelet transform is well defined as a bilinear map. The 
first and third columns list domains A’(K m ) of / and Z(R) of if, respectively. The second column lists the 
range of the Radon transform R/(u,p) for which we reused the same symbol X as it coincides. The fourth, 
fifth, and sixth columns list the range of the ridgelet transform with respect to /?, (a,/3), and (u,a,/3), 
respectively. 


/(*) 

Y(R m ) 

R/(u,p) 

T(§ m - 1 x R) 

ip{z) 

Z(R) 

B(. R) 

^/(U,Q 

y4(H) 

y(Y m+1 ) 

V 

V 

V 

£ 


£ 

£ 

£' 

£' 

V 

V 


V 

V 

S 

S 

S' 

O m 


O m 

O m 

O'c 

O'c 

S' 

S' 


S' 

S' 

L 1 

L 1 

LP n C° 

LP n 

C° 

S' 

S' 

V'n 

V'v 


v' LP 


S' 

S' 


The proof is provided in |Aj Note that each Z is (almost) the largest in the sense that the convolution 
B = X * Z converges. Thus, Table [4] suggests that there is a trade-off relation between X and Z, that is, 
as X increases, Z decreases and vice versa. 

Extension of the ridgelet transform of non-integrable functions requires more sophisticated approaches, 
because a direct computation of the Radon transform may diverge. For instance, Kostadinova et al. [30] 

we extend the ridgelet transform to L 2 ( 


extend X = S{ 
using the bounded extension procedure. 


o by using a duality technique. In § 


5.3 


% by 


Proposition 4.3 (Continuity of the Ridgelet Transform —» L°°(Y m+1 )). Fix ip e <S(R). The 

ridgelet transform : L 1 (R m ) —» L c0 (Y m+1 ) is bounded. 

Proof. Fix an arbitrary / e L 1 (R m ) and ip e <S(R). Recall that this case is absolutely convergent. By 
using the convolution form, 


ess sup 

(u,a,/ 3 ) 


(r/( u , •) * VtT) (/?) 


< II/IUhr™) - ess sup|V>a(/3)| 

(a.0) 


< II/IUhr™) • ess sup |r ■ ip{rj3)\ < go, 

(r,l 3 ) 


(48) 

(49) 


where the first inequality follows by using Young’s inequality and applying |R/(u,p)|dp = ||/||i; the 
second inequality follows by changing the variable r <— 1/a, and the resultant is finite because ip decays 
rapidly. □ 
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The ridgelet transform is injective when ip is admissible, because if ip is admissible then the 
reconstruction formula holds and thus 38$ has the inverse. However, 38$ is not always injective. For 


instance, take a Laplacian / := A g of some function g e 6>(R m ) and a polynomial ip(z) = z + 1, which 
satisfies ip^ = 0. According to Table [4j 38$ f exists as a smooth function because / e <S(R m ) and 
ip e <S'(R). In this case 31$ f = 0, which means 38$ is not injective. That is, 

38$f(u, a, /?) = (RA.g(u, •) * ^ (/?) (50) 

= ^d 2 Rg(u, •) * ip^j (/3) (51) 

= (R.9(u, •) *8 2 ip^j (/3) (52) 

= (Rg(u, •) * 0) (/3) (53) 

= 0, (54) 

where the second equality follows by the intertwining relation RAg(u,p) = d 2 Rg(u,p) [17]. Clearly the 


non-injectivity stems from the choice of ip. In fact, as we see in the next section, no polynomial can be 
admissible and thus &$ is not injective for any polynomial ip. 


4.2 Dual Ridgelet Transform with respect to Distributions 

Definition 4.4 (Dual Ridgelet Transform with respect to Distributions). The dual ridgelet transform 
38]^ of T e y(Y m+1 ) with respect to g e W(R) is given by 

^T(x) = ] im f f f T(u,a,u-x-az)??(z) d " d ^ dU , xeK™ (55) 

iZo a 

where -g(z)dz is understood as the action of a distribution g. 

If the dual ridgelet transform 38^ exists, then it coincides with the dual operator [35] of the ridgelet 
transform 38 v . 

Theorem 4.5. Let X and Z be chosen from Ta6/e[7j Fix ip e Z. Assume that 38$ : A(]R m ) —»y(Y m+1 ) 
is injective and that 38$ : y(Y m+1 ) —> A"(R m ) exists. Then 38$ is the dual operator ( 38$)' : y(Y m+1 ) —> 
X'(R m ) of 38$. 

Proof. By assumption 38$ is densely defined on A(R m ) and injective. Therefore, by a classical result 
on the existence of the dual operator [351 VII. 1. Th. 1, pp.193], there uniquely exists a dual operator 
(. 38$)' : y(Y m+1 ) — A'(]R m ). On the other hand, for / e A(R m ) and T e V(Y m+1 ), 

(38$f,T\ m+1 = f /(x)V’(a• x - 6)T(a, 6)dxdad6 = (/,Mt) . (56) 

J Rmx ¥™ + 1 X /® m 

By the uniqueness of the dual operator, we can conclude (30$)' = 38$. O 


5 Reconstruction Formula for Weak Ridgelet Transform 

In this section we discuss the admissibility condition and the reconstruction formula, not only in the Fourier 
domain as many authors did [231 EH m |30l EU , but also in the real domain and in the Radon domain. Both 
domains are key to the constructive formulation. In § |5.1| we derive a constructive admissibility condition. 
In § |5.2| we show two reconstruction formulas. The first of these formulas is obtained by using the Fourier 
slice theorem and the other by using the Radon transform. In § |5.3| we will extend the ridgelet transform 
to L 2 . 


5.1 Admissibility Condition 

Definition 5.1 (Admissibility Condition). A pair (ip, g) e <S(R) x <S'(R) is said to be admissible when 
there exists a neighborhood H c i of 0 such that g e L 1 1 oc (D\{0}), and the integral 


K$„ := (27r) m_1 



ICI m 


(57) 
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converges and is not zero, where J n w Q } and are understood as Lebesgue’s integral and the action of 
rj , respectively. 

Using the Fourier transform in W requires us to assume that Wc5'. 

The second integral is always finite because |C| -m e CCvi(R\D) and thus |CI -m tHC) decays rapidly; 
therefore, by definition the action of a tempered distribution rj always converges. The convergence of the 
first integral $ n ^ 0 j does not depend on the choice of U because for every two neighborhoods S 2 and f 2 ' of 
0, the residual is always finite. Hence, the convergence of K ^ does not depend on the choice of H. 

The removal of 0 from the integral is essential because a product of two singular distributions, which 
is indeterminate in general, can occur at 0 . See examples below. In[cJ we have to treat |C| -m as a locally 
integrable function, rather than simply a regularized distribution such as Hadamard’s finite part. If the 
integrand coincides with a function at 0 , then obviously = $ K - 

If rj is supported in the singleton {0} then r] cannot be admissible because = 0 for any ip e <S(R). 
According to Rudin [34l Ex. 7.16], it happens if and only if 77 is a polynomial. Therefore, it is natural 
to take W = S'/V = Sq rather than W = S'. That is, in <Sq(R), we identify a polynomial 77 e 'P(R) as 
0 e <S'(R). The integral „ is well-defined for <Sq(R). Namely is invariant under the addition of a 
polynomial Q to 77 


— Kij>,r)+Q- 


(58) 


Example 5.2 (Modification of Schwartz Ch.5 Th. 6 ]). Let rj(z ) = z and ip(z) = AG(z) with G(z) = 
exp(— z 2 /2). Then, 


v(0 = HO and ip(() = |C| • G(C). 

In this case the product of the two distributions is not associative 


f p.v. ~ x (|C| • G(C) x 5(0) dC = 0 , 

JR Is I 

J (p-v. X |C|> G(C)) X i(C)dC = G(0) A 0. 


On the other hand (57) is well defined 




r 

Jo<|ci<i 


ICI-G(C) xo 

ICI 


dC + 


r 


ICI • G(C) 

ICI 


d(C)dC = o. 


Example 5.3. Let rj(z) = + (2w) 1 exp iz and ip(z) = AG(z). Then, 


rj(0 = z + 6(0 + S(C - 1) and V’(C) = ICI ‘ G(C)- 


The product of the two distributions is not associative 


rqx ( |C| ' G(C) X ( 7 + <5(C) + <KC — 1) ) ) dC = G(l), 


i p T - 171 

J (p.v. ||| x |C| - G(C)) x (l + 5(C) + 5(C - 1)) dC = G(0) + G(l) # 0. 


On the other hand, (57) is well defined 
ICI ■ G(0 X iC" 1 


-^■0,77 — 


f 

Jo<|ci<i 


ICI 


-dC + 


f 

aisSlCI 


ICI -G(C) 
ICI 


(59) 

(60) 
(61) 

(62) 

(63) 

(64) 

(65) 


p.v. - + 5(0 + 5(C - 1) ) dC = oo + G(l). ( 66 ) 


Observe that formally the integrand u(0 ■= "0(C)^?(C)|C| m is a solution of |C| m 2(C) = i’iOviO- By 
taking the Fourier inversion, we have A m u = ip * 77 . To be exact, rj may contain a point mass at the origin, 
such as Dirac’s <5. 
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Theorem 5.4. (Structure Theorem for Admissible Pairs) Let (ip,rj) e <S(R) x <S'(R). Assume that there 
exists k e No such that 


ry(C) = 2^ 0) (C), Ce{0}- 

3=0 


(67) 


Assume that there exists a neighborhood Li of 0 such that fj e C 0 (O\{0}). Then ip and rj are admissible if 
and only if there exists u e Om (®0 such that 


A m u = ip * 1 rj — y~| c j Z-* 
V j'=o / 


and 


| “(CMC 

Jr\{o} 


*0, 


( 68 ) 


where A is the backprojection filter defined in ( 24). In addition, lim£_> + o |u(C)| < oo and |G(C)| < 


oo. 


The proof is provided in[Bj Note that the continuity implies local integrability. If ip has I vanishing 
moments with l ^ k. namely $ R ip(z)z^Az = 0 for j ^ I, then the condition reduces to 


A m u = ip * ip 


u(z)dz 

Jr 


< oo 


and 



(69) 


As a consequence of Theorem |5.4[ we can construct admissible pairs as below. 

Corollary 5.5 (Construction of Admissible Pairs). Given 77 e S' 0 (M). Assume that there exists a neigh¬ 
borhood Li of 0 and k e No such that C k • 77 (C) e C'°(f2). Take ipo e <S(K) such that 

f C fc ^MCM(C)dC^o. (70) 

Jr 

Then 


ip := A m ipQ k \ 


(71) 


is admissible with p. 

The proof is obvious because u := 



* 77 


satisfies the conditions in Theorem 


5.4 


5.2 Reconstruction Formula 

Theorem 5.6 (Reconstruction Formula). Let f e L 1 (R m ) satisfy / e L 1 (K m ) and let (ip, 77 ) e 5(R) x <Sq(K) 
be admissible. Then the reconstruction formula 

&l^f(x) = i^/(x), (72) 

holds for almost every x e K m . The equality holds for every point where f is continuous. 

The proof is provided in [Cj The admissibility condition can be easily inverted to (ip, rj) G S( x S. 
However, extensions to S( x S( and S x T>' may not be easy. This is because the multiplication S( ■ S( is 
not always commutative, nor associative, and the Fourier transform is not always defined over T>' [32] , 
The following theorem is another suggestive reconstruction formula that implies wavelet analysis in the 
Radon domain works as a backprojection filter. In other words, the admissibility condition requires (ip,rj) 
to construct the filter A m . Note that similar techniques are obtained for “wavelet measures” by Rubin 

mm- 

Theorem 5.7 (Reconstruction Formula via Radon Transform). Let f e L 1 (K m ) be sufficiently smooth 
and (ip,rj) e <S(R) x <S'(R) be admissible. Assume that there exists a real-valued smooth and integrable 
function u such that 

A m u = ip*r] and f u(C)dC = — 1. (73) 

Jr 

Then, 

^^/(x) = R*A m- 1 R/(x) = 2(27r) m_1 /(x), (74) 

holds for almost every x e R m . 
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The proof is provided in[p} Note that here we imposed a stronger condition on u than the u e L 1 (M\{0}) 


we imposed in Theorem 5.4 


Recall intertwining relations ( jT7| Lem.2.1, Th.3.1, Th.3.7]) 


(-A) —R* = A m R, and R(-A)~ = R*A 


* a m— 1 


(75) 


Therefore, we have the following. 

Corollary 5.8. 


= R*A m_1 R = (-A)Vr*r = R*R(-A) z V l . (76) 

5.3 Extension to L 2 

By (•, •) and || • || 2 , with a slight abuse of notation, we denote the inner product of L 2 (R m ) and L 2 (Y m+1 ). 
Here we endow Y m+1 with a fixed measure a _m dod/3du, and omit writing it explicitly as L 2 (Y m+1 ;...). 
We say that ip is self-admissible if ip is admissible in itself, i.e. the pair (ip, ip) is admissible. The following 
relation is immediate by the duality. 

Theorem 5.9 (Parseval’s Relation and Plancherel’s Identity). Let (ip,rf) e S x S' be admissible with, for 
simplicity, I\= 1. For f, g e L 1 n L 2 (R m ), 

(&qpf, &t] 9) — f i g) — if,g) ■ Parseval’s Relation (77) 


In particular, if ip is self-admissible, then 

\Wh = II/II2. 


Plancherel’s identity 


(78) 


Recall Proposition 


4.3 


that the ridgelet transform is a bounded linear operator on L^M" 1 ). If ip e <S(R) 
is self-admissible, then we can extend the ridgelet transform to L 2 (R m ), by following the bounded extension 
procedure (JO) 2.2.4]. That is, for / e L 2 (R m ), take a sequence f n e L 1 n L 2 (R m ) such that /„ —» / in L 2 . 
Then by Plancherel’s identity, 


|| fn - fmh = W^fn ~ ^fmh, Vn,m<= N. (79) 

The right-hand side is a Cauchy sequence in L 2 (Y m+1 ) as n, m —» oo. By the completeness, there uniquely 
exists the limit Too e L 2 (Y m+1 ) of We regard T^ as the ridgelet transform of / and define 

&lllf := Too. 

Theorem 5.10 (Bounded Extension of Ridgelet Transform on L 2 ). Let ip e <S(R) be self-admissible with 
The ridgelet transform on L 1 n L 2 (R m ) admits a unique bounded extension to L 2 (R m ), with 
satisfying \\^f \\2 = \\fh- 

We say that (ip,r]) and ( ip*,rf) are equivalent , if two admissible pairs (ip,rf) and (ip*,rj*) define the 
same convolution ip * rj = ip* * rj* in common. If (ip,rj) and (ip*,rf) are equivalent, then obviously 


(&*f,&v9) = (arf,&v*g)- (so) 

We say that an admissible pair (ip, rf) is admissibly decomposable, when there exist self-admissible pairs 
(ip*, ip*) and (rf ,rf) such that (ip*,rf) is equivalent to (ip, if). If (ip,rf) is admissibly decomposable with 
(ip*, rf), then by the Schwartz inequality 

(^f,^ v g) s? W&^fhW&n+gh- (81) 

Theorem 5.11 (Reconstruction Formula in L 2 ). Let f e L 2 (R m ) and (ip,rj) e S x S 1 be admissibly 
decomposable with v = 1. Then, 

SfyR+f - f, in L 2 . (82) 

The proof is provided in [E] Even when ip is not self-admissible and thus cannot be defined on 
L 2 (M. m ), the reconstruction operator can be defined with the aid of rj. 
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6 Neural Network with Unbounded Activation Functions 


In this section we instantiate the universal approximation property for the variants of neural networks. 
Recall that a neural network coincides with the dual ridgelet transform of a function. Henceforth, we 
rephrase a dual ridgelet function as an activation function. According to the reconstruction formulas 
(Theorem |5.6| [577} and 5.11), we can determine whether a neural network with an activation function g is 
a universal approximator by checking the admissibility of g. 

Table [l] lists some Lizorkin distributions for potential activation functions. In § |6.1| we verify that they 
belong to <5 >q(R) and some of them belong to Cbvt (R) and <S(R), which are subspaces of <Sq(R). In § 6.2 we 
show that they are admissible with some ridgelet function ip e <S(R); therefore, each of their corresponding 
neural networks is a universal approximator. 


6.1 Examples of Lizorkin Distributions 

We proved the class properties by using the following propositions. 

Proposition 6.1 (Tempered Distribution <S'(M) gtJJ Ex. 2.3.5]). Let g e L 1 1 oc (M). If |g(z)| < (1 + |z|) fc 
for some k e No, then g e 5'(R). 

Proposition 6.2 (Slowly Increasing Function C?m(R) [3DJ Def. 2.3.15]). Let g e £(R). If for any 
a e No, \d a g(x)\ < (1 + |z|) fc “ for some k a e No, then g e Om (R). 

Example 6.3. Truncated power functions z[ f. (k e No), which contain the ReLU z + and the step function 
z+, belong to <Sq(R). 

Proof. For any I e No there exists a constant Gt such that \d l (z\)\ ^ Cff l + |z|) fe ~ £ . Hence, z k e Sq(R). □ 

Example 6.4. The sigmoidal function cr(z) and the softplus a^~ l \z) belong to Cbvt(R)- The derivatives 
<T^ k \z) (k e N) belong to 5(R). Hyperbolic tangent tanh(z) belongs to 

The proof is provided in |F| 

Example 6.5 ( (401 Ex.2.2.2]). RBF G(z) and their derivatives G ^ k \z) belong to <S(R). 

Example 6.6 ( [40l Ex.2.3.5]). Dirac’s S(z) and their derivatives 8^ k \z) belong to 5'(R). 


6.2 Kip,ri when ip is a derivative of the Gaussian 


Given an activation function g e S' 0 ( 
function ip e <S(R) by letting 


, according to Corollary 5.5 we can construct an admissible ridgelet 


if := A”Vo, 


(83) 


where ipo e <S(R) satisfies 

■= f ^o(C)»?(C)dC ¥= 0, Too. (84) 

x ' Jk\{o} 

Here we consider the case when ipo is given by 

</> o = G W , (85) 

for some £ e No, where G denotes the Gaussian G(z) := exp(—z 2 /2). 

The Fourier transform of the Gaussian is given by G(£) = exp(—C 2 /2) = v / 27rG(C). The Hilbert 
transform of the Gaussian, which we encounter by computing ip = A m G when m is odd, is given by 

WGW ‘ tP (ti) ■ (86) 

where F(z) is the Dawson function F(z) := exp(— z 2 ) ^ exp(u> 2 )du>. 

Example 6.7. z'f ( k e No) is admissible with ip = A m G^ +fc+1 ) {£ e No) iff £ is even. If odd , then 

— 0 . 
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Proof. It follows from the fact that, according to Gel’fand and Shilov [?T1 § 9.3], 

^ u 

4(0 = 7^uTT +7r ^ (fc) (0. ke N 0 . □ 

Example 6.8. rj(z) = S^ k \z) ( k e No) is admissible with ip = A m G iff k is even. If odd, then , )7) = 0. 

In contrast to polynomial functions, Dirac’s 6 can be an admissible activation function. 

Example 6.9. rj(z) = G^ k \z) ( k e No) is admissible with if = A m G iff k is even. If odd, then ^ = 0. 

Example 6.10. r/(z) = a^ k \z) ( k e No) is admissible with ip = A m G iff k is odd. If odd, then = 0. 
crC 1 ) j,g admissible with ip = A m G". 

The proof is provided in [FJ 


7 Numerical Examples of Reconstruction 


We performed some numerical experiments on reconstructing a one-dimensional signal and a two-dimensional 
image, with reference to our theoretical diagnoses for admissibility in the previous section. Table [5] lists the 
diagnoses of (A m ip 0 ,if we employ in this section. The symbols ’+,’ ’0,’ and ’oo’ in each cell indicate that 
of the corresponding (ip,rf) converges to a non-zero constant (+), converges to zero (0), and diverges 
(oo). Hence, by Theorem 5.6 if the cell (ip,rf) indicates ’+’ then a neural network with an activation 


function r] is a universal approximator. 


Table 5: Theoretical diagnoses for admissibility of ip = A m ipo and rj. ’+’ indicates that (ip, ij) is admissible. 
’0’ and ’oo’ indicate that K^,, v vanishes and diverges, respectively, and thus (ip,rj) is not admissible. 


activation function 

V 

ip = A m G 

ip = A m G' 

ip = A m G" 

derivative of sigmoidal ft. 

o’ 

+ 

0 

+ 

sigmoidal function 

a 

00 

+ 

0 

softplus 


00 

OO 

+ 

Dirac’s S 

5 

+ 

0 

+ 

unit step function 

4 

OO 

+ 

0 

ReLU 

Z+ 

00 

OO 

+ 

linear function 

z 

0 

0 

0 

RBF 

G 

+ 

0 

+ 


7.1 Sinusoidal Curve 

We studied a one-dinrensional signal f(x) = sin 2nx defined onis [—1,1]. The ridgelet functions functions 
ip = Aipo were chosen from derivatives of the Gaussian ipo = G^, (l = 0,1,2). The activation functions 
r\ were chosen from among the softplus the sigmoidal function a and its derivative a ', the ReLU 

z .|_, unit step function z'\, and Dirac’s 6. In addition, we examined the case when the activation function 
is simply a linear function: r](z) = z , which cannot be admissible because the Fourier transform of 
polynomials is supported at the origin in the Fourier domain. 

The signal was sampled from [—1,1] with Ax = 1/100. We computed the reconstruction formula 

f [ &$f(a,b)r](ax- (87) 

JrJm l a l 
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by simply discretizing (a, b) e [—30,30] x [—30,30] by A a = A b = 1/10. That is, 


N 

b) se 'Yj f{x n )%l){a-x n -b)\a\Ax, x n = x 0 + nAx 

n =0 




I,J 


2 (ai, bj)r){ai ■ x - bj) 

(ij)=( o,o) 


AaAb 


di = ao + iAa, bj 


b 0 + jAb 


where xq = —1, ao = —30, b 0 = —30, and N = 200, (/, J) = (600, 600). 


(88) 

(89) 


i/j = AG 


= AG' 


V> = AG" 



b b b 


Figure 2: Ridgelet transform <^/(a, b) of f(x) = sin27r:r defined on [—1,1] with respect to f>. 

Figure [ 2 ] depicts the ridgelet transform b). As the order ^ of ip = AG^ increases, the localization 

of SffjJ increases. As shown in Figure [3j every can be reconstructed to / with some admissible 

activation function rj. It is somewhat intriguing that the case ijj = AG" can be reconstructed with two 
different activation functions. 

Figures [3j|4j and[5]tile the results of reconstruction with sigmoidal functions, truncated power functions, 
and a linear function. The solid line is a plot of the reconstruction result; the dotted line draws the original 
signal. In each of the figures, the theoretical diagnoses and experimental results are almost consistent and 
reasonable. 

In Figure [3] at the bottom left, the reconstruction signal with the softplus seems incompletely recon¬ 
structed, in spite of Table [ 5 ] indicating ’00’. Recall that cr( -1 )(C) has a pole £ -2 ; thus, we can understand 
this cell in terms of * AG working as an integrator , that is, a low-pass filter. 

In Figure [4j in the top row, all the reconstructions with Dirac’s 5 fail. These results seem to contradict 
the theory. However, it simply reflects the implementation difficulty of realizing Dirac’s 5 , because 5{z) is 
a “function” that is almost constantly zero, except for the origin. Nevertheless, z = ax — b rarely happens 
to be exactly zero, provided a, b , and x are discretized. This is the reason why this row fails. At the 
bottom left, the ReLU seems to lack sharpness for reconstruction. Here we can again understand that 
z + * AG worked as a low-pass filter. It is worth noting that the unit step function and the ReLU provide 
a sharper reconstruction than the sigmoidal function and the softplus. 

In Figure [5] all the reconstructions with a linear function fail. This is consistent with the theory that 
polynomials cannot be admissible as their Fourier transforms are singular at the origin. 
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Figure 4: Reconstruction with truncated power functions — Dirac’s 6, unit step z+, and ReLU z+. The 
solid line is a plot of the reconstruction result; the dotted line plots the original signal. 



Figure 5: Reconstruction with linear function rj(z) = z. The solid line is a plot the reconstruction result; 
the dotted line plots the original signal. 




























7.2 Shepp-Logan phantom 

We next studied a gray-scale image Shepp-Logan phantom [42] . The ridgelet functions il> = A 2 ipo were 
chosen from the i th derivatives of the Gaussian ip 0 = G^, (£ = 0,1,2). The activation functions p were 
chosen from the RBF G (instead of Dirac’s <5), the unit step function z? , and the ReLU z + . 

The original image was composed of 256 x 256 pixels. We treated it as a two-dimensional signal /(x) 
defined on [— 1,1] 2 . We computed the reconstruction formula 

f f ^/(a,5)?7(a-x-6)^^, (90) 

Jr Jr 2 ll a ll 

by discretizing (a, b) e [—300, 300] 2 x [—30,30] by Aa = (1,1) and A b = 1. 

Figure [6] lists the results of the reconstruction. As observed in the one-dimensional case, the results 
are fairly consistent with the theory. Again, at the bottom left, the reconstructed image seems dim. Our 
understanding is that it was caused by low-pass filtering. 



Figure 6: Reconstruction with RBF G, unit step and ReLU z + . 
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8 Concluding Remarks 


We have shown that neural networks with unbounded non-polynomial activation functions have the uni¬ 
versal approximation property. Because the integral representation of the neural network coincides with 
the dual ridgelet transform, our goal reduces to constructing the ridgelet transform with respect to distri¬ 
butions. Our results cover a wide range of activation functions: not only the traditional RBF, sigmoidal 
function, and unit step function, but also truncated power functions z +, which contain the ReLU and even 
Dirac’s S. In particular, we concluded that a neural network can approximate L 1 n C° functions in the 
pointwise sense, and L 2 functions in the L 2 sense, when its activation “function” is a Lizorkin distribution 
(<Sg) that is admissible. The Lizorkin distribution is a tempered distribution (S') that is not a polynomial. 
As an important consequence, what a neural network learns is a ridgelet transform of the target function /. 
In other words, during backpropagation the network indirectly searches for an admissible ridgelet function, 
by constructing a backprojection filter. 

Using the weak form expression of the ridgelet transform, we extensively defined the ridgelet transform 
with respect to distributions. Theorem |4.2| guarantees the existence of the ridgelet transform with respect 
to distributions. Table [3] suggests that for the convolution of distrib ution s to converge, the class X 
of domain and the class Z of ridgelets should be balanced. Proposition 4.3 states that : L 1 (K m ) —> 

,S / (Y m+1 ) is a bounded linear operator. Theorem 4.5 states that the dual ridgelet transform coincides with 


a dual operator. Provided the reconstruction formula holds, that is, when the ridgelets are admissible, the 
ridgelet transform is injective and the dual ridgelet transform is surjective. 

For an unbounded ijeZ (R) to be admissible, it cannot be a polynomial and it can be associated with 
a backprojection filter. If r\ e Z(R) is a polynomial then the product of distributions in the admissibility 
condition should be indeterminate. Therefore, Z(M) excludes polynomials. Theorem 5.4 rephrases the 
admissibility condition in the real domain. As a direct consequence, Corollary |5.5| gives a constructive 
sufficiently admissible condition. 

After investigating the construction of the admissibility condition, we showed that formulas can be 
reconstructed on L 1 ! 


) in two ways. Theorem 5.6 uses the Fourier slice theorem. Theorem 5.7 


uses 


approximations to the identity and reduces to the inversion formula of the Radon transform. Theorem |5.7| 


as well as Corollary 5.8 suggest that the admissibility condition requires (ip, i]) to construct a backprojection 
filter. 

In addition, we have extended the ridgelet transform on L 1 (R m ) to L 2 (R m ). Theorem 5.9 states 
that Par seval’ s relation, which is a weak version of the reconstruction formula, holds on L 1 n L~(R m ). 
Theorem 5.10 follows the bounded extension of 38$ from L 1 n L 2 (R m ) to L 2 (R m ). Theorem 5.11 gives the 
reconstruction formula in L 2 (R m ). 

By showing that z+ and other activation functions belong to S' 0 , and that they are admissible with some 
derivatives of the Gaussian, we proved the universal approximation property of a neural network with an 
unbounded activation function. Numerical examples were consistent with our theoretical diagnoses on the 
admissibility. In addition, we found that some non-admissible combinations worked as a low-pass filter; 
for example, (ip,rj) = (A m [Gaussian], ReLU) and (ip,rj) = (A m [Gaussian], softplus). 

We plan to perform the following interesting investigations in future. 

1. Given an activation function rj e £>q(R), which is the “best” ridgelet function ip e <S(R)? 


In fact, for a given activation function 77 , we have plenty of choices. By Corollary |5.5[ all 
elements of 


A v := ^A m ip 0 ip 0 e <S(R) such that < rj,ipo ) is finite and nonzero. [ , (91) 

are admissible with 77 . 


2. How are ridgelet functions related to deep neural networks? 


Because ridgelet analysis is so fruitful, we aim to develop “deep” ridgelet analysis. One of the 
essential leaps from shallow to deep is that the network output expands from scalar to vector 
because a deep structure is a cascade of multi-input multi-output layers. In this regard, we 
expect Corollary |5.8| to play a key role. By using the intertwining relations, we can “cascade” 
the reconstruction operators as below 

= R*A fc - 1 R(-A) m " 1 -^R*A £ - 1 R. (0 ^k,£^m) (92) 

This equation suggests that the cascade of ridgelet transforms coincides with a composite of 
backprojection filtering in the Radon domain and differentiation in the real domain. We con¬ 
jecture that this point of view can be expected to facilitate analysis of the deep structure. 
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A 


Proof of Theorem 


4.2 


A ridgelet transform a,/3) is the convolution of a Radon transform R/(u,p) and a dilated distri¬ 

bution ip a (p) in the sense of a Schwartz distribution. That is, 

/(x) R/(u,p) >-> (R/(u, •) * ijj^ (/?) = @ i ,f( u, a, /3). (93) 

We verify that the ridgelet transform is well defined in a stepwise manner. Provided there is no danger of 
confusion, in the following steps we denote by X the classes D, S, 0' c , L 1 , or T>' L1 . 

Step 1: Class A(S m_1 x R) of R./(u,p) 

Hertle’s results found [39] Th 4.6, Cor 4.8] that the Radon transform is the continuous injection 

R : A(r) A(S m_1 x R), (94) 

where X = T>, S, 0' c , L 1 , or V L1 \ if / e X(M. m ) then R/ 6 A(S m_1 x R), which determines the second 
column. Our possible choice of the domain X is restricted to them. 
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Step 2: Class B(R) of a, f3) with respect to P 

Fix a > 0. Recall that &^f(u,a,p) = ^R/(u, •) * ~ip^\ (P) in the sense of Schwartz distributions. By the 
nuclearity of X [33] § 51], the kernel theorem 

xl)s (95) 

holds. Therefore, we can omit u e § m_1 in the considerations for (cn,/3) e EL According to Schwartz’s 
results shown in Table [3j for the convolution g * ip oi g e A(M) and ip e Z(M) to converge in £>(M), 
we can assign the largest possible class Z for each X as in the third column. Note that for X = L 1 
we even assumed the continuity Z = L p n C°, which is technically required in Step 3. Obviously for 
Z = V , S' , L p n C° , or V' Lvi if ip e Z(R) then ip a e Z(M). Therefore, we can determine the fourth column 
by evaluating X * Z according to Table [3] 

Step 3: Class A(EI) of a,/3) with respect to (a,/3) 

Fix u 0 e S m_1 and assume / e A(R m ). Write g(p) := R/(u 0 ,p) and 

WbP] 9 ](u, ft) := J g{az + P)ip(z)dz, (96) 

then &Tpf(u 0 ,a,/3) = W[ip', g](a, /3) for every (a, P) e EL By the kernel theorem, g e A(R). 

Case 3a: (X = V and Z = V then B = £ and A = £) 

We begin by considering the case in the first row. Observe that 

d a yV[i>; g]{a, P) = d a f g{az + P)ip(z)dz = f g'{az + p)z ■ ip(z)dz = W[z ■ ip-, g'](a, p), (97) 

JR JR 

dgW[ip;g](a,P) = dp f g(az + P)ip{z)dz = f g [az + p)ip(z)dz = W[-0; g']{u, P), (98) 

JR JR 


and thus that for every k,i e N 0 , 

d k a d^W[ip-,g]{a,p) = W[z k ■ i>-,g( k+ V](a,p). (99) 

Obviously if g e V(R) and ip e T>'(R) then g( k+t > e V(R) and z k ■ ip e 2?'(M), respectively, and thus 
d k dpW[ip;g](a,P) exists at every (a,/3) e EL Therefore, we can conclude that if g e V(R) and ip e 2?'(R) 
then Y\>[ip-,g\ e £(EI). 


Case 3b: (X = £' and Z = V then B = V and A = V) 

Let g e £'(R) and ip e V(R). We show that W[ip; g] e P'(EI), that is, for every compact set Kci, there 
exists N e No such that 


f T(a,p)W[ip;g](a,p) —— 

Jk ot 


< 2 sup \d k d e pT(a,p)\, 
k,e^N (a./ 3 )^ 


VT e V(K). 


( 100 ) 


Fix an arbitrary compact set K cz El and a smooth function T e 1?(K), which is supported in K. Take two 
compact sets A cz M + and Bel such that K cz A x B. By the assumption that g e £'(R) and ip e 2?'(R), 
there exist k,i e N 0 such that 


u(z)g{z )d 

JR 

d. 

JR 


< sup I u 

zGsupp g 


< sup |u 
2gR 


w 


(fc) (z)|, Vm e£(R) 

(101) 

(z) |, VveV(B). 

(102) 
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Observe that for every fixed a, T(a, •) * g e V'(R ). Then, by applying (101) and (102) incrementally, 


f T(a,ft) f g(az + ft)ip{z)dz^^- f f f T(a, ft - az)ip{z)dz ■ g(ft)dft 
Jr Jr a Jo Jr Jr 


da 

a 


< 


< 


J 

JO 


sup 

0 f3esuppg 


f dpT(a, /3 — az)'ip(z)dz 

JR 

' d k+l T(a, ft - az) 

f sup da +e T(a, ft) a e ~ 1 da 
J A /3eB 


J 

Jo 


sup sup 

0 /3esupp g z 


da 

a 

da 


. /3eB 

< sup 

(a,/3)eK 


d k+e T(a,ft) - J a* _ 1 da, 


(103) 

(104) 

(105) 

(106) 
(107) 


where the third inequality follows by repeatedly applying d z \T{a,ft — az)\ = (— a)dpT(a, ft — az); the 
fourth inequality follows by the compactness of the support of T. Thus, we conclude that W[ip', g\ e P'(H). 


Case 3c: (X = S and Z = S' then B = Om and A = Om) 

Let g e <S(R) and ip e <S'(R). Recall the case when X = V. Obviously, for every k,i e No, g( k+ ^ e <S(R) 
and z k ■ ip e <S'(R), respectively, which implies W[ip; g] G S (H). Now we even show that W\ip', g] G Om( H), 
that is, for every k, £ e No there exist s, t e No such that 

</](«,/?)| < (a + l/a) s (l + /3 2 ) 4 / 2 . (108) 


Recall that by (99), we can regard d k dpW\ip; < 7 ](a, ft) as 8^dpW[ipo; go\(ot, ft), by setting g 0 := e 

and ipo := z k ■ ip G <S'(R). Henceforth we focus on the case when k = £ = 0. Since ip G <S'(R), there 


exists N e No such that 


u(z)ip(z)dz 

Jr 


< £ sup !*-«<*>(*)|, 

s,t^N zeR 


Mu e 5(M). 


By substituting u(z) <— g(ctz + ft), we have 


g(az + ft)ip(z)dz 

Jr 


< J] su P|2 s <5‘s(a2 + /3)| 

s,t^N zeR 


= Yt sup 

s.tsSiV P eR 



< ^ a t ~ s ft s sup \p s g^ (p) \ 

s,t^N P eR 

<(a+l/a) N (l+ft 2 ) N / 2 , 


(109) 


( 110 ) 

( 111 ) 

( 112 ) 

(113) 


where the second equation follows by substituting p <— az + ft', the fourth inequality follows because every 
sup p \p s g t {p)\ is finite by assumption that g e <S(R). Therefore, we can conclude that if g G <S(R) and 
ip g <S'(R) then W[ip\ g] g OmQ HI). 


Case 3d: (X = 0' c and Z = S' then B = S' and .4 = 5') 

Let g G 0' C (M) and ip e S'(R). We show that W[ip',g] g 5'(H), that is, there exists N g No depending only 
on ip and g such that 

< ^ sup \B k f t T(a,ft)\, VT g 5(H) (114) 

s,t,k,e^N “>^ eH 

where we defined 

Dg,’*T(a, ft) := (a + l/a) s (1 + ft 2 )^ 2 d%T (a, ft). (115) 


f T(a, ft)W[ip', g\(a, ft) 

Jh a 
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Fix an arbitrary T £ 5(H). By the assumption that ip £ 5'(R), there exist s,teNo such that 


f 

Jr 


u{z)ip{z)dz 


< sup Vue 


Observe that for every fixed a, T(a, •) * g e 

dad/3 


f T(a,/3) f g(az + /3)ip(z)dz- 

Jh Jr 




< 


< 




< 


< 


. Then we can provide an estimate as below. 

f f f T(a,p)g(az + /3)d/3-ip(z)dz 

Jo Jr Jr 

f sup z* f D°’gT(a,/3)5 (s) (az + /3)d/3 
Jr 2 Jr 

f sup p* f D^ t 0 T(a,/3)g (s) (p + /3)d/3 
Jr p Jr 

f f sup|p t < ? (s) (p + ^)||D°f t 0 T(a,^)|- 

Jr Jr p 

f f sup |(1 + |p + /3| 2 ) t/2 g (s) (p + /3)||D°^ tit T(a,/3)| 

Jr Jr p 

d/3da 


da 

a 

da 

a 

da 

a 

d/3d a 


d/3da 


1 


Id 0 ’ 0 

I s+t,t 


T(a, /?) 


( 116 ) 

(117) 

(118) 

(119) 

( 120 ) 

( 121 ) 

( 122 ) 


. 0,0 


< sup |D s+t+£ 

(a,/9)eH 


. t+5 T(a,/3)| f (a+l/a)- e (l + p 2 Y 

Jm 


5/2 df3da 


(123) 


where the second inequality follows by repeatedly applying d z [g(az + /3)] = a• g'{az + /3) and a < a + 1/a; 
the third inequality follows by changing the variable p <— az and applying (a + l/a) s • a - * < (a + l/a) s+t ; 
the fifth inequality follows by applying \p\ < (1+p 2 ) 1 / 2 and Peetre’s inequality 1+p 2 < (1 + /? 2 )(1 + \p+/3\ 2 ); 
the sixth inequality follows by the assumption that (1 + p 2 Y^ 2 g(p) is bounded for any t ; the last inequality 
follows by Holder’s inequality and the integral is convergent when e > 0 and <5 > 1. 

Case 3e: (X = L 1 and Z = L p n C° then B = L p n C° and A = S') 

Let g £ L 1 (M) and ip £ L p n C^M). We show that W[g;ip\ £ 5'(H), that is, it has at most polynomial 
growth at infinity. Because ip is continuous, g * ip is continuous. By Lusin’s theorem, there exists a 
continuous function g* such that g*(x) = g{x) for almost every x £ R; thus, by the continuity of g * ip, 


g* * ip{x) = g * ip(x), for every x £ R. 

By the continuity and the integrability of g* and ip, there exist s, t £ K. such that 

^ (l + x 2 )~ s/2 , s>l, 

\ip{x)\ < (1 + x 2 )~ t//2 , tp > 1 


Therefore, 


g{x)ip 

Jr 


X ~ , 
a 


—dx 

a 


< 


J (1 + x 2 )~ s/2 f 1 + ^ 


x - , 

a 


-t/2 


dx 


a 


-l 


f (1 + x 2 ) s ! 2 (l + (x — /3) 2 ') da; 
Jr ' ' 

< (i + p 2 )-™A*,t)/ + y a y-y 


(1 + a 2 )*/ 2 a- 


(124) 

(125) 

(126) 

(127) 

(128) 
(129) 


which means W\ip; g\ is a locally integrable function that grows at most polynomially at infinity. 

Note that if (t — 1 )p < m — 1 then W\ip;g\ £ L P (M; a _rra dad/3), because \W[ip’, g](a, /3)\ p behaves as 

min(s,t) a (t-i)p at in fi nity _ 

Case 3f: ( X = T>' f , and Z = V' LP then B = V' LP and A = S') 

Let g £ 2?/i(M) and ip £ V LP . We estimate (114). Fix an arbitrary T £ 5(H). By the assumption that 
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ip £ 2?£p(R), for every fixed a, T(a, ■) * ip e £ n L P (R). Therefore, we can take ip* e £ n L P (R) such that 
ip* = ip a.e. and T(a, ■) * ip* = T(a, •) * ip. In the same way, we can take g* e £ n L 1 (K) such that g* = g 
almost everywhere and T(ct, ■) * g* = T(a, •) * g. Therefore, this case reduces to show that if X = L l and 
Z = L p then .4 = S'. This coincides with case 3e. 

Step 4: Class ^(Y m+1 ) of ^ v ,/(u, a, /3) 

The last column (O’) is obtained by applying (y(Y m+1 ) = A’(S m_1 )g)Sl(IHI). Recall that for S m_1 , as it is 
compact, V = S = Om = £ and £' = 0' c = S' = V LP = V. Therefore, we have y as in the last column 
of Table [fj 


B 


Proof of Theorem 


5.4 


Let (ip, rp) e 6>(R) x <S'(R). Assume that rj is singular at 0. That is, there exists k e No such that 


/c 

v(0=f i c j S U \0, C e {0}. (130) 

3=0 

Assume there exists a neighborhood 14 of 0 such that rj e C°(14\{0}). Note that the continuity implies 
local integrability. We show that ip and rj are admissible if and only if there exists u e Om (R) such that 


A m u = ip * (rj — Xj 

V i =o 

Recall that the Fourier transform Cbvt(R) —» 
function 1 r\{o}(C) is always finite. 





and 


j »(C)dC 

Jr\{o} 


A 0. 


(131) 


is bijective. Thus, the action of u on the indicator 


Sufficiency: 

On 14\{0}, rj coincides with a function. Thus the product V'(C)^(C)ICI _Tn is defined in the sense of ordinary 

functions, and coincides with 2(C). On R\fi, |C|~ m is in Cbvt(R\14). Thus, the product ip(()rj(() |C| -m 
is defined in the sense of distributions, which is associative because it contains at most one tempered 
distribution (S ■ S' ■ Om), and reduces to u((). Therefore, 

+f W> d <’ < 132 ) 

( 2 tt ) \Jn\{o} JR\n J 

which is finite by assumption. 


Necessity: 

Write 14 0 : = H n [—1,1] and fR := R\fl 0 . By the assumption that V’(C)i?(C)ICI _m dC is absolutely 

convergent and rj is continuous in flo\{0}, there exists £ > 0 such that 


i’iCMO % ICI 


\m—l+£ 


C G ^o\{ 0 }. 


(133) 


Therefore, there exists Vq e L 1 (R) n C°(R\{0}) such that its restriction to 14 o \{0} coincides with 


^(0^7(0 = ICr^o(C), Ce^o\{0}. 


(134) 


By integrability and continuity, vq e L°°(R). In particular, both lim^_ >+ o Uo(C) and lim^_,_o uo(C) are 
finite. _ 

However, in fR, |C| _m e 0^(11 1 ). By the construction, ip ■ rj e 0(,(R). Thus, there exists v\ e (^(R) 
such that 

Wm) = icr^(o, ceni- (135) 


where the equality is in the sense of distribution. 
Let 


v := v 0 • ln 0 + ui ■ lfij- 


(136) 
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Clearly, v e 0' C (R) because vq ■ ln 0 e £'(R) and v\ ■ g 0£.(R). Therefore, there exists u e C>x(R) such 
that u = v and 


mm = icrm 


(137) 


By the admissibility condition, 

f u(C)dC= f u 0 (C)dC+ f Ui(C)dC^0. 
lt\{o) ao 0 \{o} J fii 

In consideration of the singularity at 0, we have 

W) [v(0 2 c ^ (j) (oj = icrs(o, c£ k. 

By taking the Fourier inversion in the sense of distributions, 

/ k ^ 


(138) 


(139) 


[v -2 


CjZ J 


j=0 


( z) = A m u(z ), 2 GR. 


(140) 


c 


Proof of Theorem 


5.6 


Let / G L 1 (R m ) satisfy / g L 1 (R m ) and () G £>(R) x <Sg(R) be admissible. For simplicity, we rescale ij) 
to satisfy = 1. Write 


We show that 


J(x;e, 5):= (u, a, u • x - az) rj(z) 

Js m ~ 1 Je Jr 


dzdaclu 


(141) 


lim /(x; e, S) = /(x), a.e. x G ! 
5—> oo 
£—>0 


(142) 


and the equality holds at every continuous point of /. 

By using the Fourier slice theorem in the sense of distribution, 


J (u, a,/3 - az) ^(z)dz = J f(u}u)$(au:)rj(auj)e lull3 duj 

■ f f(uju)u(auj)\au}\ rn e lul ^di 
JR\fot 


1 

2-7T 


where |C| m 2(C) := (( + 0) is defined as in Theorem 

Then, 


5.4 


f f f /( wu ) u(au>) |a;| m e* w/3 dadw 

Je - a 2?r Jm\{ 0} Je 


i f°° r 

2tt Jo J r 




u(C)/(wu)e’"P|w| m - 1 dCdw 


0 


M(C)/(sgn(C)ru) exp(sgn(C)ir/3)r m dCdr, 


(143) 

(144) 


(145) 

(146) 

(147) 


where the second equation follows by changing the variable £ <— aco with a m ~ 1 da = |w| m-1 |C| -m d((; the 
third equation follows by changing the variable r <— |cj| with sgnw = sgn^. In the following, we substitute 
fj <— u • x. Observe that in du, 


/(—ru) exp (—iru • x)du = f(r u) exp(i?’u ■ x)du; 

Jg™-i Js m — 1 


( 148 ) 


hence, we can omit sgn £. 
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Then, by substituting /? <— u • x and changing the variable £ <— ru, 


J(x;£, 8) = f (|147h du 

J S'"- 1 


1 

2n 

1 

27T 


i r fj 

I [f 


U 


(CMC 


re^\C\^r8 

MCMC 


||e|| e <K|«||€||i 


/(ru)e* ru x r m_ 1 drdu, 


/(Oe'« x dC<iC. 


(149) 

(150) 

(151) 


Recall that u G 0' c (R); thus, its action is continuous. That is, the limits and the integral commute. 
Therefore, 


^/(x) 


lim /(x; e, <5) 

<5—>oo 
£— >0 


r 

1 

<5" 

Jr™ 

JR\{0} 


1 

(2tt) 

1 

(2tt ) r 
/(x), a.e. x g 


/(C)e l « x d| 


f /(€)« 

Jr™ 




dC 


(152) 

(153) 

(154) 

(155) 


where the last equation follows by the Fourier inversion formula, a consequence of which the equality holds 
at Xq if / is continuous at Xq. 


D Proof of Theorem 5.7 


Let / G L 1 (R m ) and ( x[>,r ;) G <S(R) x £>'(R). Assume that there exists u G £ n L X (R) that is real-valued, 
A m u = t/j *r] and $ R S(CMC = — 1 - Write 

f f S f < 

/(x; £, 6) := (u,a,u ■ x - az) r/(z)~ 

J§ m ~i Je Jr 


clzdadu 


a“ 


We show that 


lim /(x; £, 6) = R*A x R(x), a.e. xg 
(5—►00 
£—>0 


In the following we write (-) Q (p) = (• )(p/a)/a. By using the convolution form, 

J (u, a, /3 - az) 7?(z)dz = |r/(u, ■) * (^ * ??) J (/?) 


= [R/( u , ■) * (A m u)cJ (/3). 


Observe that 


]>-.).<■.)£-a- 1 {,(a»)( 5) q . 


da 


= A” 


= A” 


J ’P/e 

(Au)(z)dz 

v/8 


V Jp/s 


1 n, (P\ 1 nj (P 

- -Hu 


-Hu 

P \£J p 

= h rn - 1 [k e [p) - k s (p )], 


(156) 


(157) 

(158) 

(159) 

(160) 
(161) 

(162) 

(163) 


where the first equality follows by repeatedly applying (A u) a = aA(u a ); the second equality follows by 
substituting z <— p/a; the fourth equality follows by defining 


k(z) := -Hu(z) and fc 7 (p) := —k(— ) for 7 = e, 6. 

z 7 V 7 . 


(164) 
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Therefore, we have 


f'O l 

dl59|^ = [R/(u, •) * fll63l)] (/J) (165) 

J £ & 

= [A m “ 1 R/(u, •) * (k e - fc«)] (/3). (166) 

We show that k e L 1 n L°°(K) and k(z)dz = 1. To begin with, k e L 1 (K) because there exist s, t > 0 
such that 

\k(z)\ <\z\~ 1+s , as \z\ —> 0 
|fc(z)| < as \z\ —» oo. 

The first claim holds because u is real-valued and thus u is odd, then 

Hu(0) = f sgnC • 2(C)dC 

Jr 

= f 2(C)dC- f «(C)dC 

J(— oo,01 J(0,oo) 


(167) 

(168) 


= 0 . 


(169) 

(170) 

(171) 


The second claim holds because u e L 1 (K) and thus u as well as Hu decays at infinity. Then, by the 
continuity and the integrability of k. it is bounded. By the assumption that ) u(£)d£ = —1, 


J k{z)dz = - J ^^do 

JR JR V ~ z 


= -«( 0 ) 
= 1 . 


(172) 

(173) 

(174) 


Write 

J(u,p) : = A"*- 1 R/(u,p). (175) 

Because k e L 1 (K) and $ R k(z)dz = 1, k e is an approximation of the identity [43], III, Th.2]. Then, 

lim J(u, •) * k E (p) = J(u,p), a.e. (u,p) e S m_1 x R. (176) 

e—>0 

However, as k e L°°(K), 

II J * &<5|U co (S m - 1 xR) ^ & II'7||l 1 (S™- 1 xr)II^IU 0 O (r)) (177) 

and thus, 


lim J(u, •) * k$(p) = 0, a.e. (u ,p) e S m 1 x 
< 5 —>00 


(178) 


Because it is an approximation to the identity, J * /c 7 e L 1 (S m 1 x K) for 0 < 7 . Hence, there exists a 


maximal function M(u,p) [43] III, Th.2] such that 

sup|(J(u, •) *v e )(p)\ < M(u,p). 


0 <£ 


(179) 


Therefore, | J(u, •) * (v s — u^)(u • x) | is uniformly integrable [35] Ex. 4.15.4] on § m A That is, if H cz § m 1 
satisfies du ^ A then 


I | J(u, •) * fc 7 |(u • x)du < Asup \M(u,p)\, V 7 ^ 0. 
J n u ,p 

Thus, by the Vitali convergence theorem, we have 

M^/(x) = lim I [J(u, •) * (v e - u 5 )](u • x)du 

= J( u, u ■ x)du, 

J S "*- 1 


a.e. x e. 


= R*A m " 1 R/(x). 


(180) 

(181) 

(182) 

(183) 
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E Proof of Theorem 5.11. 


Let / G L 2 (R m ) and (^>, 77 ) be admissible with K ^ = 1. Assume without loss of generality that ( ip,ip) 
and ( 77 , 77 ) are self-admissible respectively. Write 


dzdadu 


J [/; (M)]( x ) := f f f (u, a, u ■ x - az) rj(z) 

J S ' 7 *- 1 Je Jr 

In the following we write <5] := S m_1 x [R + \(e, 5)] x 1 c Y m+1 . We show that 

lim \\f - I[f-,(e,5)]\\ = 0 . 

S— XX) z 

£— >0 

Observe that 


< 


(184) 


(185) 


sup (/-/[/;(£, <5)], g)\ 

\\9h=i 

(186) 

sup | (^i/>/, q[ Ej 5 ] 1 

|9||2=1 

(187) 

SUp | L 2(fj[ e \^v9 L 2( Ym + l) 

lfl| 2 =l 

(188) 

sup •^’ 1 /'/L 2 (n[e, 5 ]) 5 2 

119112 = 1 

(189) 

0-1, as £ —» 00 —» 00 

(190) 


where the third inequality follows by the Schwartz inequality; the last limit follows by \\&ipf\\L 2 (n[s,8])i 
which shrinks as the domain fl[£, 6 ] tends to 0 . 


F Proofs of Example |6.4| and Example |6.10| 

Let cr(z) := (1 + e -2 ) -1 . Obviously a(z) G £ (R). 

Step 0: Derivatives of a(z). 

For every k G N, 

a {k \z) = S k {a(z)), 

where S k {z) is a polynomial defined by 


f z( 1 — z) k = 1 

S k (z) := { 

[ S'^^S^z) k> 1 , 

which is justified by induction on k. 

Step 1: er, tanh G L/n(R). 

Recall that |er(z)| ^ 1. Hence, for every k G N, 

|cr (fc) (z)| = \S k (cr(z)) | max \S k (z)\ < 00 . 

ze[o,i] 

Therefore, every k G No, cr^(z) is bounded, which concludes a(z) G Cbn(R). 
Hence, immediately tanh G Om (R) because 

tanh(z) = 2a(2z) — 1 . 


(191) 


(192) 


(193) 


(194) 


Step 2: cr( fe ) g5(R), k e N. 
Observe that 


a'(z) = (e z/2 + e~ z/2 )~ 2 . 


(195) 
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Hence, a'(£) decays faster than any polynomial, which means sup, \z £ <j'{z)\ < oo for any i e No- Then, for 
every fc,feNo, 

sup \z e a^ k+1 ^(z)\ = sup \z e Sk+i{cr{z))\ ^ max \z e a\z)\ • max |S'fc(cr(z))| < oo, (196) 

z z z z 

which concludes ed G <S(R). Therefore, ed fc ) e £>(M) for every k eN. 

Step 3: ed~d G 0>i(R)- 
Observe that 


ed 1 ^(z) = cr(w)dw. (197) 

Jo 

Hence, it is already known that [ed _ d]( fc ) = ed fc d g 0^(11) for every k G N. We show that ed ^(z) has 
at most polynomial growth. Write 

p(z) := a^(z) -z+. (198) 

Then p(z) attains at 0 its maximum max, p(z) = log 2, because p'{z) < 0 when z > 0 and p'(z) > 0 when 
z < 0. Therefore, 


^ (_1) (~)| < W)\ + |z+| < log2 + |,2r|, 


(199) 


which concludes cd 1 \z)sOm- 

Step 4: ?7 = ed fc ) is admissible with ip = A m G when k G N is positive and odd. 

Recall that rj = ed fc f G <S(R). Hence, (rj, ip 0 } = (?7, Vto)* Observe that if k is odd, then cd fc ) is an odd 
function and thus (r], ipo) = 0. However, if k is even, then cd fc ) is an even function and thus (rj, ipo) ^ 0. 


Step 5 : er and ed d cannot be admissible with ip = A m G. 


This follows by Theorem 5.4 because both 

J" ^G * a^j (z)dz and J ^G * cd _ d^ (z)dz, 

diverge. 

Step 6: er and ed d are admissible with ip = A m G' and ip = A m G", respectively. 
Observe that both 

uq := C * a = G * a 1 and u-\ := G" * cd~d = Q * ed, 


( 200 ) 


( 201 ) 


belong to <S(R). Hence, uq and U -1 satisfy the sufficient condition in Theorem 5.4 
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